Topic models are easy to train, but do they generate useful topics? In this post, we discuss how to calculate several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.
To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19 (Tweets were pooled based on hashtag). We use Python to calculate diagnostic measures from Mallet topic-term frequency output files. Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) topic frequency, and 2) top-term specificity. Furthermore, on average, we found common topics with specific top terms score significantly better on coherence scores than uncommon topics with unspecific top terms. However, we also found many poor topics score relatively high on coherence scores. In other words, our results suggest topics that capture the specific, main topics of a corpus should be easier to interpret, but the interpretability of a topic doesn’t imply a topic is specific or central to a corpus.
First, we compare the evaluation metrics associated with each of the data sets. The density plots below show there are clear differences between the diagnostic measures for Facebook and Twitter topics. For example, Facebook has much better topic coherence scores which implies the top terms of each topic co-occur more often in Facebook posts. Likewise, Facebook topics have a larger tokens per topic count which indicates the topic terms occur more frequently in the Facebook corpus.
Prior to performing PCA, we must standardize the data so the metrics have a common scale. We standardize the data so that each measure has a mean of zero and a standard deviation of 1. The density plots below show the distributions of the standardized data.
Correlation analysis of the evaluation metrics shows that we are dealing with highly correlated multivariate data. There is a strong negative correlation between token count and exclusivity-in other words, exclusive topics don’t occur frequently in the corpus. We also see a negative correlation between average word length and the effective number of words which implies effective terms tend to be shorter in length. Corpus entropy has a strong positive correlation with exclusivity; whereas, coherence has a strong negative correlation with corpus entropy and exclusivity. In other words, the top terms for coherent topics tend to use words that are common to the corpus, and coherent topics tend to occur more frequently.
A scree plot of the eigenvalues of the correlation matrix suggests we should retain two principal components (PCs). The general rule of thumb is to keep PCs that are “one less than the elbow” of the scree plot or PCs with an eigenvalue of 1 or greater.
The loading matrix below shows token count, corpus entropy, exclusivity, and coherence contribute the most to the 1st PC, which explains 44.6% of the variance based on the eigenanalysis (2.679/6 = 44.6%). Uniform entropy and the effective number of words contribute most to the 2nd PC, which explains 31.5% of the variance. Coherence contributes most to the 3rd PC, which explains 11.4% of the variance.
Based on the loading plot loading plots below, the 1st PC appears to capture topic frequency. Exclusivity and corpus entropy are positioned on the far right and imply a topic does not appear often in the corpus. Token count is positioned on the far left and implies a topic appears often in the corpus. The 2nd PC appears to capture top-term frequency. The number of effective words is positioned on the bottom and implies the top terms in a topic are top terms in many topics. Uniform entropy is located at the top and indicates a topic distribution (as well as the top terms of that distribution) does not add much information when compared to using a uniform distribution.
Interestingly, we did not identify coherence as a key variable throughout the PCA or subsequent interpretation of the results. However, an interesting pattern emerges in the score plots of the 1st and 2nd PCs when points are sized based on coherence values-namely, more coherent topics (i.e., the points with a smaller diameter) have relatively large positive scores for the 1st and 2nd PC. This implies that coherent topics may be more representative of the corpus since they appear more often, and the top terms of coherent topics tend to appear in many topics. Stated differently, coherence appears to capture elements of the two primary variables-topic frequency and top-term frequency.
All of the measures match or interpretation of the PCs. However, coherence is a bit trickier. Good topics score well in coherence.